Computation of Substring Probabilities in Stochastic Grammars

نویسنده

  • Ana L. N. Fred
چکیده

The computation of the probability of string generation ac cording to stochastic grammars given only some of the symbols that com pose it underlies pattern recognition problems concerning the prediction and or recognition based on partial observations This paper presents algorithms for the computation of substring probabilities in stochastic regular languages Situations covered include pre x su x and island probabilities The computational time complexity of the algorithms is analyzed Introduction The computation of the probability of string generation according to stochastic grammars given only some of the symbols that compose it underlies some pat tern recognition problems such as prediction and recognition of patterns based on partial observations Examples of this type of problems are illustrated in in the context of automatic speech understanding and in concerning the prediction of a particular physiological state based on the syntactic analysis of electro encephalographic signals Another potential application in the area of image understanding is the recognition of partially occluded objects based on their string contour descriptions Expressions for the computation of substring probabilities according to sto chastic context free grammars written in Chomsky Normal Form have been pro posed In this paper algorithms for the computation of substring proba bilities for regular type languages expressed by stochastic grammars of the form Fi Fi Fi Fj Fi Fj VN representing the grammar start symbol and VN corresponding to the non terminal and terminal symbols sets respectively are described This type of grammars arises for instance in the process of grammatical inference based on Crespi Reghizzi s method when structural samples assume the form dc fgb e cd ab meaning some sort of temporal alignment of sub patterns The algorithms presented next explore the particular structure of these gram mars being essentially dynamic programming methods Situations described in clude the recognition of xed length strings probability of exact recognition section highest probability interpretation section and arbitrary length strings pre x section su x section and island probabilities sec tion The computational complexity of the methods is analyzed in terms of worst case time complexity Minor and obvious modi cations of the above algorithms enable the computation of probabilities according to the standard form of regular grammars Notation and De nitions Let G VN Rs be a stochastic context free grammar where VN is the nite set of non terminal symbols or syntactical categories is the set of terminal symbols vocabulary Rs is a nite set of productions of the form pi A A VN VN the star representing any combination of symbols in the set and pi is the rule probability and VN is the start symbol When rules take the particular form A aB or A a with A B VN and a then the grammar is designated as nite state or regular Along the text the symbols A G andH will be used to represent non terminal symbols The following de nitions are also useful w wn def nite sequence of terminal symbols H def derivations of from H through the application of an arbitrary number of rules CT def number of terminal symbols in with repetitions CN def number of non terminal symbols in with repetitions nmin H G def min n max fCT H Gg The following graphical notation is used to represent derivation trees Arcs are associated with the direct application of rules For instance the rule H w w G is represented by w w G H H H or w w G H H H The triangle represents derivation trees having the top non terminal symbol as root and leading to the string on the base of the triangle

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Precise N-Gram Probabilities from Stochastic Context-Free Grammars

We present an algorithm for computing n-gram probabilities from stochastic context-free grammars, a procedure that can alleviate some of the standard problems associated with n-grams (estimation from sparse data, lack of linguistic structure, among others). The method operates via the computation of substring expectations, which in turn is accomplished by solving systems of linear equations der...

متن کامل

Computation of the Probability of the Best Derivation of an Initial Substring from a Stochastic Context-Free Grammar

Recently, Stochastic Context-Free Grammars have been considered important for use in Language Modeling for Automatic Speech Recognition tasks [6, 10]. In [6], Jelinek and Lafferty presented and solved the problem of computation of the probability of initial substring generation by using Stochastic Context-Free Grammars. This paper seeks to apply a Viterbi scheme to achieve the computation of th...

متن کامل

Computation of Infix Probabilities for Probabilistic Context-Free Grammars

The notion of infix probability has been introduced in the literature as a generalization of the notion of prefix (or initial substring) probability, motivated by applications in speech recognition and word error correction. For the case where a probabilistic context-free grammar is used as language model, methods for the computation of infix probabilities have been presented in the literature,...

متن کامل

-gram Probabilities from Stochastic Context-free Grammars

We present an algorithm for computing n-gram probabilities from stochastic context-free grammars, a procedure that can alleviate some of the standard problems associated with n-grams (estimation from sparse data, lack of linguistic structure, among others). The method operates via the computation of substring expectations, which in turn is accomplished by solving systems of linear equations der...

متن کامل

Computation of the Probability of Initial Substring Generation by Stochastic Context-Free Grammars

Speech recognition language models are based on probabilities P(Wk+I = v [ W l W 2 ~ . . . , W k ) that the next word Wk+l will be any particular word v of the vocabulary, given that the word sequence Wl, w 2 , . . . , Wk is hypothesized to have been uttered in the past. If probabilistic context-free grammars are to be used as the basis of the language model, it will be necessary to compute the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000